1 Background

The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 112 million cases have been confirmed worldwide, with nearly 2.5 million associated deaths. Within the US alone, there have been over 500,000 deaths and upwards of 28 million cases reported. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.

The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.

We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.

There are two main goals for this case study.

  1. We show the dynamic evolvement of COVID cases and COVID-related death at state level.
  2. We try to figure out what county-level demographic and policy interventions are associated with mortality rate in the US. We try to construct models to find possible factors related to county-level COVID-19 mortality rates.

Remark: please keep track with the most updated version of this write-up.

2 Data Summary

The data comes from several different sources:

  1. County-level infection and fatality data - This dataset gives daily cumulative numbers on infection and fatality for each county.
  2. County-level socioeconomic data - The following are the four relevant datasets from this site.
    1. Income - Poverty level and household income.
    2. Jobs - Employment type, rate, and change.
    3. People - Population size, density, education level, race, age, household size, and migration rates.
    4. County Classifications - Type of county (rural or urban on a rural-urban continuum scale).
  3. Intervention Policy Data - This dataset is a manually compiled list of the dates that interventions/lockdown policies were implemented and lifted at the county level.

3 EDA

In this case study, we use the following three cleaned data:

  • covid_county.csv: County-level socialeconomic information that combines the above-mentioned 4 datasets: Income (Poverty level and household income), Jobs (Employment type, rate, and change), People (Population size, density, education level, race, age, household size, and migration rates), County Classifications
  • covid_rates.csv: Daily cumulative numbers on infection and fatality for each county
  • covid_intervention.csv: County-level lockdown intervention.

Among all data, the unique identifier of county is FIPS.

The cleaning procedure is attached in Appendix 2: Data cleaning You may go through it if you are interested or would like to make any changes.

First read in the data.

# county-level socialeconomic information
county_data <- fread("data/covid_county.csv") 
# county-level COVID case and death
covid_rate <- fread("data/covid_rates.csv")
# county-level lockdown dates 
covid_intervention <- fread("data/covid_intervention.csv")

3.1 Understand the data

The detailed description of variables is in Appendix 1: Data description. Please get familiar with the variables. Summarize the two data briefly.

head(covid_intervention)
head(county_data)
head(covid_rate)
summary(covid_rate)
The three datasets are "county_data", "covid_intervention" and "covid_rate". The first two share in common that each ovservation is a US county. County_data is an extremely wide dataset, it has 205 variables describing each county's socioeconomic demographics. Most variables have numeric values, but there are a few categorical variables describing the county's Type (e.g. Type_2015_Recreation_NO - whether the county mainly relies on recreation industry; RuralUrbanContinuumCode2013 - where the county lands on the Rural-urban continuum). Covid_intervention describes each county's measures towards covid. The variables are dates on which the county stopped and resumed "gathering", "public schools", "travel ban", "entertainment" and etc.
The third dataset, covid_rate, is extremely long. Because each observation is one day, and there is cumulative case total and death total for all the counties in all the US states from January 2020 to Feburary 2021 (for some counties, there is only data available for part of the timecourse). There's also an estimate of total population of that county in 2019.

3.2 COVID case trend

It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.

  1. Plot new COVID cases in NY, WA and FL by state and by day. Any irregular pattern? What is the biggest problem of using single day data?
covid_rate_sub <- covid_rate %>% filter(State=="New York" | State == "Washington" | State == "Florida") 
day_state <- covid_rate_sub %>% group_by(State,date) %>% summarize(cum_cases_state = sum(cum_cases))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
As shown in the r-code above, we first created a subset of covid_rate containing only NY, WA and FL. In order to get a new column "new cases by day by state", we will first need to sum cumylative cases in all the counties in the state to get the cumulative cases of the state. We did that by group covid_rate by state and by date then get a summarized sum. 
diff_day<-rep(0,1112) #1112 is the length of daily_increment_State
daily_increment <- cbind(day_state, diff_day)
## New names:
## * NA -> ...4
split_by_state <- split(daily_increment, daily_increment$State)
split_by_state <- lapply(split_by_state, function(x) transform(x, diff_day= c(NA, diff(cum_cases_state))))
daily_increment <- do.call(rbind, split_by_state)
Then, in order to get the difference of cum_cases(cumulative total of cases) between consecutive days, we first created a column "diff_day", filled it with zeros and appended it to our long dataset daily_increment_State. We then splitted the long dataset by State into 3 shorter datasets about each State. For each of the 3 States, we filled the first grid of 'diff_day' column with N/A since we don't have data about the date prior to the first day listed, and the following grids with the difference between cum_cases  in this row and cum_cases in the previous row with the diff() function.  An important assumption is that, within data of each state, covid-rate is already sorted by date. Conveniently, this assumption is already satisfied. 
daily_increment_plot <- daily_increment %>%
  ggplot(aes(x = date, y = diff_day, col = State)) + 
  geom_line() + 
  geom_point() +
  theme_bw() +
ggtitle("Daily New COVID cases in NY, FL and WA")
ggplotly(daily_increment_plot +
           theme(legend.position = "none"))
The biggest difficulty to interpretation that the new cases by day graph presented is that the lines are not "smooth" - too many zigzagging. This means that this visualization might be presenting disturbances/noises that might distract our focus. For example, there is a big notable up-down for Florida between January 1st 2021 and January 2nd 2021. It is currently saying that there is 0 new cases on January 1st 2021 and over 30,000 new cases on January 2nd 2021. It is probably due to the fact that people didn't record the new cases on January 1st 2021, but that big jump is indeed disturbing.
A separate problem is that in the current analysis, we didn't take into account the population differences by the three states at all. So we wouldn't be able to tease apart how much of the smaller peak of Washington is due its smaller base population.
  1. Create weekly new cases per 100k weekly_case_per100k. Plot the spaghetti plots of weekly_case_per100k by state. Use TotalPopEst2019 as population.
week_state <- covid_rate %>% group_by(State,week)%>% mutate(cum_cases_week_state = sum(cum_cases))%>% ungroup() %>%
  group_by(State) %>% mutate(statePopEst2019 = sum(TotalPopEst2019))%>% ungroup() %>% 
  select(State,week, cum_cases_week_state, statePopEst2019) %>% distinct(State, week, .keep_all = TRUE)
diff_week<-rep(0,2518) #2518 is the length of week_state
weekly_increment <- cbind(week_state, diff_week)
split_by_state <- split(weekly_increment, weekly_increment$State)
split_by_state <- lapply(split_by_state, function(x) transform(x, diff_week= c(NA, diff(cum_cases_week_state))))
weekly_increment <- do.call(rbind, split_by_state)
weekly_increment <- weekly_increment %>% mutate(weekly_new_per100k = diff_week * 100000 /statePopEst2019 )
In order to create the variable "weekly new per 100k", we first get cumulative total of every week for every state. We also need to summarize the total population in 2019 of all the counties in a state to get the population of a state. Once we get weekly cumulative and population by state, we repeat the process that we used to obtain daily new cases above to get weekly new cases of very state. The new variable "weekly new per 100k" for every state would be weekly new cases divided by the state's total population then multiplying 100,000.   
weekly_increment_plot <- weekly_increment %>%
  ggplot(aes(x = week, y = weekly_new_per100k, group = State, col = State)) + 
  geom_line() + 
  geom_point() +
  theme_bw()+
  ggtitle("Weekly New COVID Cases per 100k")+
  theme(legend.position = "none")
weekly_increment_plot

We noticed that during earlier weeks of, there appeared to be negative weekly new cases, which is probably due to misreport. In order to more clearly examine the trend visually, we exclude all data point that showed negative weekly increase.
weekly_increment_excl <- weekly_increment %>% filter(weekly_new_per100k >= 0)
weekly_increment_plot_excl <- weekly_increment_excl %>%
  ggplot(aes(x = week, y = weekly_new_per100k, col = State)) + 
  geom_line() + 
  geom_point() +
  theme_bw()+
  ggtitle("Weekly New COVID Cases per 100k")
ggplotly(weekly_increment_plot_excl +
           theme(legend.position = "none"))
  1. Summarize the COVID case trend among states based on the plot in ii). What could be the possible reasons to explain the variabilities?
summary(covid_intervention)
##       FIPS          STATE            AREA_NAME          stay at home       
##  Min.   : 1000   Length:3272        Length:3272        Min.   :2020-03-18  
##  1st Qu.:19026   Class :character   Class :character   1st Qu.:2020-03-25  
##  Median :30022   Mode  :character   Mode  :character   Median :2020-03-29  
##  Mean   :31368                                         Mean   :2020-03-29  
##  3rd Qu.:46101                                         3rd Qu.:2020-04-03  
##  Max.   :72153                                         Max.   :2020-04-08  
##                                                        NA's   :576         
##  >50 gatherings       >500 gatherings      public schools      
##  Min.   :2020-03-16   Min.   :2020-03-12   Min.   :2020-03-16  
##  1st Qu.:2020-03-19   1st Qu.:2020-03-16   1st Qu.:2020-03-17  
##  Median :2020-03-22   Median :2020-03-19   Median :2020-03-18  
##  Mean   :2020-03-22   Mean   :2020-03-20   Mean   :2020-03-19  
##  3rd Qu.:2020-03-26   3rd Qu.:2020-03-26   3rd Qu.:2020-03-20  
##  Max.   :2020-04-03   Max.   :2020-04-03   Max.   :2020-04-03  
##  NA's   :200          NA's   :200                              
##  restaurant dine-in   entertainment/gym    federal guidelines  
##  Min.   :2020-03-13   Min.   :2020-03-13   Min.   :2020-03-17  
##  1st Qu.:2020-03-18   1st Qu.:2020-03-18   1st Qu.:2020-03-17  
##  Median :2020-03-20   Median :2020-03-20   Median :2020-03-17  
##  Mean   :2020-03-20   Mean   :2020-03-21   Mean   :2020-03-17  
##  3rd Qu.:2020-03-22   3rd Qu.:2020-03-24   3rd Qu.:2020-03-17  
##  Max.   :2020-04-05   Max.   :2020-04-07   Max.   :2020-03-17  
##                       NA's   :67                               
##  foreign travel ban   stay at home rollback >50 gatherings rollback
##  Min.   :2020-03-12   Min.   :2020-04-25    Min.   :2020-05-02     
##  1st Qu.:2020-03-12   1st Qu.:2020-05-05    1st Qu.:2020-05-16     
##  Median :2020-03-12   Median :2020-05-16    Median :2020-05-31     
##  Mean   :2020-03-12   Mean   :2020-05-17    Mean   :2020-05-29     
##  3rd Qu.:2020-03-12   3rd Qu.:2020-06-01    3rd Qu.:2020-06-13     
##  Max.   :2020-03-12   Max.   :2020-06-23    Max.   :2020-07-09     
##                       NA's   :604           NA's   :2020           
##  >500 gatherings rollback restaurant dine-in rollback
##  Min.   :2020-05-02       Min.   :2020-04-25         
##  1st Qu.:2020-05-17       1st Qu.:2020-05-05         
##  Median :2020-06-02       Median :2020-05-12         
##  Mean   :2020-05-31       Mean   :2020-05-15         
##  3rd Qu.:2020-06-13       3rd Qu.:2020-05-19         
##  Max.   :2020-07-09       Max.   :2020-07-09         
##  NA's   :2230             NA's   :424                
##  entertainment/gym rollback
##  Min.   :2020-04-25        
##  1st Qu.:2020-05-05        
##  Median :2020-05-12        
##  Mean   :2020-05-16        
##  3rd Qu.:2020-05-22        
##  Max.   :2020-07-09        
##  NA's   :564
The first pattern that attracts our attention is that there are variability of weekly new cases across time: for some period, there is growing number of new cases across weeks, whereas for some period the number of new cases are falling. And this pattern across time, is followed by every state. That is, there are there are apparently three peaks starting respectively around week 12 (April 5-11 2020), week 25 (July 5-11 2020) and week 43 (Nov 8- 14 2020). Peak is where the weekly increase culminate and start to decrease, so in order to find possible reasons to interpret the pattern, we mainly ask 1) what happend 2-3 weeks before the peak to accelerate the growth of new cases? 2) what happend right around the peak to make the curve start to wind down?
The first peak might be the natural culmination since the first detection of the pandemic without intervention. Through summarizing the first variables in covid_intervention, we would notice the mean date for the start of lockdown are all in the latter half of March. This probably explains why the first peak is in early April - as travel and interaction has been largely restricted since the first wave of lock down , weekly new cases begin to decrease.
However, weekly new cases began to be on the rise after it hit bottom in early June (week 21). Many factors may have contributed to this rise: after George Floyd's death on May 26, 2020, there was a peak of demonstrations and protests related to Black Life Matters in late May and early June in US. As shown in the following image from the report "Demonstrations & Political Violence in America: New Data for Summer 2020" by The Armed Conflict Location & Event Data Project (ACLED), in week of June 7, corresponding to our week 21, demonstrations related to Black Life Matters hit a peak, of more than 3000 movements in that week. On top of the series of heated civil movement, we can also learn from the summary of covid_intervention that most states issued policies of "lockdown rollback" in mid-late May. It also contributed to the rise of weekly new cases two weeks later.

“Demonstrations & Political Violence in America: New Data for Summer 2020, The Armed Conflict Location & Event Data Project (ACLED).”

Finally,the second peak passed and hit bottom in week 30, presumably as a consequence of another round of lockdown policies as well as subsided number of movement related to Black Life Matter. However, quickly it was on the rise again towards the biggest third peak. We don't have any lockdown data that could shed insights on the development of the third peak. However, we hypothesized that the large number of political gatherings leading to the presidential election contributed to the sharp growth. The fact that the third wave hit peak in the election week provided support for this hypothesis - since as the results settled, cases began to wind down. And we attributed the subsequent smaller peak on the way down in week 51 as resonance of winter holiday travel.
Other than the three-peak timecourse pattern shared by all the states, there is some variability among the states as well. For some states, the ups and downs in the curve are in really small scale, and in general, weekly new cases have been pretty low (e.g. Vermont, Maine, Washington). If we take a closer look, the leading states for the three peaks are different. New York is among the highes during the first peak. It makes sense because if the first peak is the culmination of the spreading of the cases through contact and travels without interventions - where else would experience more contact and travel than New York! Leading the second peak are Arizona and Florida, one hypothesis could be that hot weather is one contributing factor. South Dakoda, North Dakoda and Iowa are leading the third peak. Searching them in the covid_intervention dataset, we found that South Dakoda and North Dakoda both have NAs for fields "stay at home", "> 50 gathering" ">500 gatherings" and "entertainment", and Iowa had NAs for "stay at home". If NA means there was never policies spelling out the related lockdown order in these states, one hypothesis is that the lack of strict lockdown policies led to the observation that they are leading the third peak without being prevalent in the previous two waves.
  1. (Optional) Use covid_intervention to see whether the effectiveness of lockdown in flattening the curve.

3.3 COVID death trend

  1. For each month, plot the monthly deaths per 100k heatmap by state on US map. Use the same color range across months. (Hints: Set limits argument in scale_fill_gradient.)
pop <- county_data %>%
  group_by(State) %>%
  summarise(
    total.state.population = max(TotalPopEst2019,na.rm=TRUE),
    new.fips = min(FIPS)
  ) %>%
  rename(FIPS = new.fips)

pop <- merge(pop, county_data[,c(1,3)], sort=FALSE, by = "FIPS", all.x=TRUE)
pop <- pop %>%
  rename(Abbr = State) %>%
  rename(State = County)
head(pop)
##   FIPS Abbr total.state.population      State
## 1 2000   AK                 731545     Alaska
## 2 1000   AL                4903185    Alabama
## 3 5000   AR                3017804   Arkansas
## 4 4000   AZ                7278717    Arizona
## 5 6000   CA               39512223 California
## 6 8000   CO                5758736   Colorado
months <- c("2020-03-31", "2020-04-30", "2020-05-31", "2020-06-30", "2020-07-31", "2020-08-31", "2020-09-30", "2020-10-31", "2020-11-30", "2020-12-31", "2021-01-31", "2021-02-20")
months <- as.Date(months)

temp.covid_rate <- covid_rate %>%
  filter(date %in% months)

temp.covid_rate <- temp.covid_rate %>%
  group_by(State, date) %>%
  summarise(
    deaths = sum(cum_deaths)
  ) 
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
temp.covid_rate <- merge(temp.covid_rate, pop, by = "State", all.x = TRUE)
temp.covid_rate$mortality.rate <- (temp.covid_rate$deaths / temp.covid_rate$total.state.population) * 100000
data.s <- temp.covid_rate %>% rename(full=State, state=Abbr)
head(data.s)
##      full       date deaths FIPS state total.state.population mortality.rate
## 1 Alabama 2020-03-31     14 1000    AL                4903185          0.286
## 2 Alabama 2020-04-30    272 1000    AL                4903185          5.547
## 3 Alabama 2020-05-31    630 1000    AL                4903185         12.849
## 4 Alabama 2020-06-30    950 1000    AL                4903185         19.375
## 5 Alabama 2020-07-31   1580 1000    AL                4903185         32.224
## 6 Alabama 2020-08-31   2182 1000    AL                4903185         44.502
max_mort_col <- quantile(data.s$mortality.rate, 1, na.rm = T)
min_mort_col <- quantile(data.s$mortality.rate, 0, na.rm = T)


# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

monthly.plots <- list()
for (i in 1:12) {
  data.s.plot <- data.s %>% 
    filter(date == months[i]) %>%
    mutate(hover =  paste(state, '<br>', 
                        'mortality rate', round(mortality.rate, 2)))
  fig <- plot_geo(data.s.plot, locationmode = 'USA-states')
  fig <- fig %>% add_trace(
      z = ~mortality.rate, text = ~hover, locations = ~state,
      color = ~mortality.rate, colors = 'Blues',
      zmin = min_mort_col, zmax = max_mort_col
    )
  fig <- fig %>% colorbar(title = "Mortality rate of state")
  fig <- fig %>% layout(
      title = toString(c("State Moratlity Rate", toString(months[i]))),
      geo = g,
      hoverlabel = list(bgcolor="white")
    )
  
  monthly.plots[[i]] <- fig
}

ggplotly(monthly.plots[[1]])
ggplotly(monthly.plots[[2]])
ggplotly(monthly.plots[[3]])
ggplotly(monthly.plots[[4]])
ggplotly(monthly.plots[[5]])
ggplotly(monthly.plots[[6]])
ggplotly(monthly.plots[[7]])
ggplotly(monthly.plots[[8]])
ggplotly(monthly.plots[[9]])
ggplotly(monthly.plots[[10]])
ggplotly(monthly.plots[[11]])
ggplotly(monthly.plots[[12]])
  1. Use plotly to animate the monthly maps in i). Does it reveal any systematic way to capture the dynamic changes among states?
data.s$full <- tolower(data.s$full)
data.s <- data.s %>%
  rename(region=full)
data.s$center_lat <- data.s$state
data.s$center_long <- data.s$state

states <- map_data("state")
map <- merge(states, data.s, sort=FALSE, by = "region", all.x=TRUE)
map <- map[order(map$order), ]
head(map)
##    region  long  lat group order subregion       date deaths FIPS state
## 1 alabama -87.5 30.4     1     1      <NA> 2020-03-31     14 1000    AL
## 2 alabama -87.5 30.4     1     1      <NA> 2020-04-30    272 1000    AL
## 3 alabama -87.5 30.4     1     1      <NA> 2020-05-31    630 1000    AL
## 4 alabama -87.5 30.4     1     1      <NA> 2020-06-30    950 1000    AL
## 5 alabama -87.5 30.4     1     1      <NA> 2020-07-31   1580 1000    AL
## 6 alabama -87.5 30.4     1     1      <NA> 2020-08-31   2182 1000    AL
##   total.state.population mortality.rate center_lat center_long
## 1                4903185          0.286         AL          AL
## 2                4903185          5.547         AL          AL
## 3                4903185         12.849         AL          AL
## 4                4903185         19.375         AL          AL
## 5                4903185         32.224         AL          AL
## 6                4903185         44.502         AL          AL
data.s.plot <- data.s %>% 
  mutate(hover =  paste(state, '<br>', 
                        'mortality rate', round(mortality.rate, 2))) %>%
  mutate(date =  lubridate::month(date) )

data.s.plot$date <- ifelse(data.s.plot$date %in% c(1,2), data.s.plot$date + 10, data.s.plot$date - 2)

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

fig <-plot_geo(data.s.plot, locationmode = 'USA-states')
fig <- fig %>% add_trace(
    z = ~mortality.rate, text = ~hover, locations = ~state, frame= ~date,
    color = ~mortality.rate, colors = 'Blues', 
    zmin = min_mort_col, zmax = max_mort_col
  )
fig <- fig %>% colorbar(title = "Mortality rate of state")
fig <- fig %>% layout(
    title = 'Cumulative Mortality Rate of State: March 2020 - Feb 2021',
    geo = g,
    hoverlabel = list(bgcolor="white")
  ) %>%
  animation_slider(
    currentvalue = list(prefix = "Month: ")
  )

fig

4 COVID factor

We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.

  1. Create the response variable total_death_per100k as the total of number of COVID deaths per 100k by Feb 1, 2021. We suggest to take log transformation as log_total_death_per100k = log(total_death_per100k + 1). Merge total_death_per100k to county_data for the following analysis.
covid_factor <- covid_rate[which(covid_rate$date == "2021-02-01")]

covid_factor$total_death_per100k <- (covid_factor$cum_deaths / covid_factor$TotalPopEst2019) * 100000
covid_factor$log_total_death_per100k <- log(covid_factor$total_death_per100k + 1)

covid_factor <- covid_factor[, c(1,5,6,7,10)]

covid_factor <- covid_factor %>% merge(county_data, by = "FIPS", all=TRUE)

length(unique(county_data$FIPS))
## [1] 3278
length(unique(covid_rate$FIPS))
## [1] 3108
length(unique(covid_factor$FIPS))
## [1] 3278
#all nas are either total state/country level information, low-population countys in alaska, or Puerto Rico
covid_factor[which(is.na(covid_factor$log_total_death_per100k)), c(1,5,6,7,18)]
##       FIPS log_total_death_per100k State           County TotalPopEst2019
##   1:     0                      NA    US    United States       328239523
##   2:  1000                      NA    AL          Alabama         4903185
##   3:  2000                      NA    AK           Alaska          731545
##   4:  2010                      NA    AK Aleutian Islands              NA
##   5:  2013                      NA    AK   Aleutians East            3337
##  ---                                                                     
## 167: 72147                      NA    PR          Vieques            8386
## 168: 72149                      NA    PR         Villalba           21372
## 169: 72151                      NA    PR          Yabucoa           32282
## 170: 72153                      NA    PR            Yauco           33575
## 171: 72153                      NA    PR            Yauco           33860
#remove all NAs
covid_factor <- covid_factor %>%
  filter(log_total_death_per100k != "NA")

head(covid_factor)
##    FIPS cum_cases cum_deaths week log_total_death_per100k State  County
## 1: 1001      5683         69   55                    4.82    AL Autauga
## 2: 1003     18211        224   55                    4.62    AL Baldwin
## 3: 1005      1956         40   55                    5.09    AL Barbour
## 4: 1007      2309         52   55                    5.45    AL    Bibb
## 5: 1009      5720        100   55                    5.16    AL  Blount
## 6: 1011      1089         29   55                    5.66    AL Bullock
##    MedHHInc PerCapitaInc PovertyUnder18Pct PovertyAllAgesPct Deep_Pov_All
## 1:    59338        29372              19.3              13.8         6.14
## 2:    57588        31203              13.9               9.8         4.48
## 3:    34382        18461              43.9              30.9        12.75
## 4:    46064        20199              27.8              21.8         5.90
## 5:    50412        22656              18.0              13.2         7.31
## 6:    29267        20346              68.3              42.5        17.66
##    Deep_Pov_Children PovertyUnder18Num PovertyAllAgesNum PopChangeRate1819
## 1:              8.91              2509              7587             0.605
## 2:              6.21              6442             21069             2.469
## 3:             26.71              2242              6788            -0.748
## 4:              9.26              1238              4400             0.121
## 5:             12.33              2374              7527             0.095
## 6:             44.64              1433              3610            -0.718
##    PopChangeRate1019 TotalPopEst2019 NetMigrationRate1019 NaturalChangeRate1019
## 1:             2.001           55869                0.686                 1.315
## 2:            21.911          223234               21.001                 0.910
## 3:            -9.664           24686               -8.797                -0.867
## 4:            -2.081           22394               -2.099                 0.017
## 5:             0.784           57826               -0.005                 0.790
## 6:            -7.126           10101               -7.769                 0.644
##    Net_International_Migration_Rate_2010_2019 PopChangeRate0010
## 1:                                     -0.029             24.96
## 2:                                      0.714             29.80
## 3:                                      0.161             -5.44
## 4:                                      0.525             10.03
## 5:                                      0.207             12.34
## 6:                                      0.699             -6.83
##    NetMigrationRate0010 NaturalChangeRate0010 Immigration_Rate_2000_2010
## 1:                11.87                  5.46                    -0.0102
## 2:                26.17                  3.32                     1.5845
## 3:                -4.80                  2.29                     1.8284
## 4:                 6.43                  2.10                     0.3415
## 5:                12.27                  3.14                     2.0770
## 6:                -9.13                  3.20                     3.6612
##    PopDensity2010 Under18Pct2010 Age65AndOlderPct2010 WhiteNonHispanicPct2010
## 1:           91.8           26.8                 12.0                    77.2
## 2:          114.7           23.0                 16.8                    83.5
## 3:           31.0           21.9                 14.2                    46.8
## 4:           36.8           22.7                 12.7                    75.0
## 5:           88.9           24.6                 14.7                    88.9
## 6:           17.5           22.3                 13.5                    21.9
##    BlackNonHispanicPct2010 AsianNonHispanicPct2010
## 1:                   17.58                    0.86
## 2:                    9.31                    0.74
## 3:                   46.69                    0.39
## 4:                   21.92                    0.10
## 5:                    1.26                    0.20
## 6:                   69.97                    0.18
##    NativeAmericanNonHispanicPct2010 HispanicPct2010 MultipleRacePct2010
## 1:                             0.40            2.40                1.59
## 2:                             0.63            4.38                1.49
## 3:                             0.22            5.05                0.94
## 4:                             0.28            1.77                0.89
## 5:                             0.50            8.07                1.19
## 6:                             0.18            7.12                0.79
##    NonHispanicWhitePopChangeRate0010 NonHispanicBlackPopChangeRate0010
## 1:                             21.05                             29.17
## 2:                             25.92                             18.17
## 3:                            -13.19                             -4.11
## 4:                              8.32                              9.60
## 5:                              8.41                             21.07
## 6:                            -13.46                            -10.00
##    NonHispanicAsianPopChangeRate0010 NonHispanicNativeAmericanPopChangeRate0010
## 1:                            140.72                                       16.7
## 2:                            152.35                                       52.2
## 3:                             28.92                                      -49.6
## 4:                             29.41                                       39.1
## 5:                             64.29                                       34.4
## 6:                             -4.76                                      -46.0
##    HispanicPopChangeRate0010 MultipleRacePopChangeRate0010
## 1:                     114.8                        114.57
## 2:                     224.1                         85.74
## 3:                     190.2                         21.70
## 4:                      93.3                         89.72
## 5:                      70.2                         31.79
## 6:                     141.3                          4.88
##    WhiteNonHispanicNum2010 BlackNonHispanicNum2010 AsianNonHispanicNum2010
## 1:                   42154                    9595                     467
## 2:                  152200                   16966                    1340
## 3:                   12837                   12820                     107
## 4:                   17191                    5024                      22
## 5:                   50952                     724                     115
## 6:                    2392                    7637                      20
##    NativeAmericanNonHispanicNum2010 HispanicNum2010 MultipleRaceNum2010
## 1:                              217            1310                 869
## 2:                             1146            7992                2723
## 3:                               60            1387                 258
## 4:                               64             406                 203
## 5:                              285            4626                 684
## 6:                               20             777                  86
##    ForeignBornPct ForeignBornEuropePct ForeignBornMexPct NonEnglishHHPct
## 1:          2.018               0.3351             0.346           0.654
## 2:          3.445               0.7717             0.820           1.425
## 3:          2.506               0.2405             1.299           1.470
## 4:          1.443               0.0355             0.169           0.687
## 5:          4.359               0.2897             3.131           1.461
## 6:          0.927               0.1739             0.000           1.053
##    Ed1LessThanHSPct Ed2HSDiplomaOnlyPct Ed3SomeCollegePct Ed4AssocDegreePct
## 1:            11.31                32.6              20.3              8.07
## 2:             9.74                27.6              22.0              9.36
## 3:            26.97                35.7              18.1              7.04
## 4:            16.79                47.3              18.6              5.75
## 5:            19.84                34.0              21.4             12.05
## 6:            24.77                39.7              17.0              5.31
##    Ed5CollegePlusPct AvgHHSize FemaleHHPct HH65PlusAlonePct OwnHomePct
## 1:              27.7      2.59       11.39             10.5       74.9
## 2:              31.3      2.61        9.52             12.8       73.6
## 3:              12.2      2.49       19.33             14.5       61.4
## 4:              11.5      2.99       13.65             12.9       75.1
## 5:              12.6      2.77        9.60             11.0       78.6
## 6:              13.3      2.75       24.47             10.8       75.5
##    ForeignBornNum TotalPopACS ForeignBornAfricaPct Ed3SomeCollegeNum
## 1:           1114       55200               0.0000              7554
## 2:           7170      208107               0.0913             32266
## 3:            646       25782               0.0000              3287
## 4:            325       22527               0.0000              2938
## 5:           2513       57645               0.1145              8492
## 6:             96       10352               0.1449              1205
##    Ed2HSDiplomaOnlyNum Ed1LessThanHSNum TotalPop25Plus Ed5CollegePlusNum
## 1:               12119             4204          37166             10291
## 2:               40579            14310         146989             46075
## 3:                6486             4901          18173              2220
## 4:                7471             2650          15780              1813
## 5:               13489             7861          39627              5010
## 6:                2817             1760           7104               945
##    TotalOccHU ForeignBornAsiaPct Ed4AssocDegreeNum ForeignBornEuropeNum
## 1:      21115              0.884              2998                  185
## 2:      78622              0.559             13759                 1606
## 3:       9186              0.310              1279                   62
## 4:       6840              0.226               908                    8
## 5:      20600              0.241              4775                  167
## 6:       3609              0.222               377                   18
##    NonEnglishHHNum HH65PlusAloneNum OwnHomeNum FemaleHHNum TotalHH
## 1:             138             2226      15814        2405   21115
## 2:            1120            10093      57881        7485   78622
## 3:             135             1333       5640        1776    9186
## 4:              47              885       5135         934    6840
## 5:             301             2276      16197        1977   20600
## 6:              38              388       2726         883    3609
##    ForeignBornCentralSouthAmPct ForeignBornCentralSouthAmNum
## 1:                        0.636                          351
## 2:                        1.518                         3160
## 3:                        1.870                          482
## 4:                        1.057                          238
## 5:                        3.528                         2034
## 6:                        0.319                           33
##    ForeignBornCaribPct ForeignBornCaribNum ForeignBornAfricaNum
## 1:              0.1196                  66                    0
## 2:              0.3681                 766                  190
## 3:              0.0116                   3                    0
## 4:              0.1243                  28                    0
## 5:              0.1041                  60                   66
## 6:              0.0676                   7                   15
##    ForeignBornAsiaNum ForeignBornMexNum LandAreaSQMiles2010
## 1:                488               191                 594
## 2:               1164              1707                1590
## 3:                 80               335                 885
## 4:                 51                38                 623
## 5:                139              1805                 645
## 6:                 23                 0                 623
##    Age65AndOlderNum2010 TotalPop2010 Under18Num2010
## 1:                 6546        54571          14613
## 2:                30568       182265          41898
## 3:                 3909        27457           6015
## 4:                 2906        22915           5201
## 5:                 8439        57322          14106
## 6:                 1469        10914           2430
##    Net_International_Migration_2000_2010 NaturalChangeNum0010
## 1:                                  -4.5                 2404
## 2:                                2239.5                 4686
## 3:                                 530.5                  664
## 4:                                  68.0                  418
## 5:                                1061.5                 1606
## 6:                                 424.0                  371
##    NetMigrationNum0010 TotalPopEst2012 TotalPopEst2013 TotalPopEst2010
## 1:                5224           54954           54727           54773
## 2:               36984          190145          194885          183112
## 3:               -1393           27169           26937           27327
## 4:                1280           22667           22521           22870
## 5:                6268           57580           57619           57376
## 6:               -1058           10606           10549           10876
##    TotalPopEst2014 TotalPopEst2011 Net_International_Migration_2010_2019
## 1:           54893           55227                                   -16
## 2:          199183          186558                                  1307
## 3:           26755           27341                                    44
## 4:           22553           22745                                   120
## 5:           57526           57560                                   119
## 6:           10663           10675                                    76
##    NaturalChange1019 TotalPopEst2015 TotalPopEst2016 TotalPopEst2017
## 1:               720           54864           55243           55390
## 2:              1667          202939          207601          212521
## 3:              -237           26283           25806           25157
## 4:                 4           22566           22586           22550
## 5:               453           57526           57494           57787
## 6:                70           10400           10389           10176
##    NetMigration1019 TotalPopEst2018 TotalPopEstBase2010 UnempRate2019
## 1:              376           55533               54597           2.7
## 2:            38455          217855              182265           2.7
## 3:            -2404           24872               27455           3.8
## 4:             -480           22367               22915           3.1
## 5:               -3           57771               57322           2.7
## 6:             -845           10174               10911           3.6
##    UnempRate2018 UnempRate2017 UnempRate2016 UnempRate2015 UnempRate2014
## 1:           3.6           3.9           5.1           5.2           5.8
## 2:           3.6           4.1           5.3           5.5           6.1
## 3:           5.1           5.8           8.3           8.9          10.5
## 4:           3.9           4.4           6.4           6.6           7.2
## 5:           3.5           4.0           5.4           5.4           6.1
## 6:           4.6           4.9           6.8           7.9           8.8
##    UnempRate2010 UnempRate2007 PctEmpChange1019 PctEmpChange1819
## 1:           8.9           3.3             8.65             0.78
## 2:          10.0           3.1            26.03             3.12
## 3:          12.3           6.3            -8.33             2.83
## 4:          11.4           4.1             6.38             1.83
## 5:           9.8           3.2             9.77             1.88
## 6:          11.8           9.4             6.59             1.58
##    PctEmpChange0719 PctEmpChange0710 PctEmpAgriculture PctEmpMining
## 1:             7.98            -0.62             0.551        0.332
## 2:            18.20            -6.22             0.956        0.372
## 3:           -15.19            -7.49             4.404        0.115
## 4:            -0.15            -6.14             1.901        4.087
## 5:            -4.36           -12.88             1.485        0.796
## 6:            40.36            31.68             5.939        0.000
##    PctEmpConstruction PctEmpManufacturing PctEmpTrade PctEmpTrans
## 1:               6.00               12.95       11.65        6.98
## 2:               8.27                9.12       16.80        4.93
## 3:               6.25               23.89       13.41        6.92
## 4:               9.01               18.71       10.87        4.89
## 5:               9.46               16.41       15.36        6.86
## 6:               4.92               33.76        7.41        5.20
##    PctEmpInformation PctEmpFIRE PctEmpServices PctEmpGovt NumCivEmployed
## 1:             1.629       6.26           43.4      10.25          24124
## 2:             1.370       7.40           45.7       5.07          93379
## 3:             0.241       3.97           33.7       7.10           8720
## 4:             1.309       4.58           39.4       5.27           8099
## 5:             1.251       5.84           38.0       4.59          21346
## 6:             0.431       2.31           33.5       6.55           3940
##    UnempRate2011 NumCivLaborForce2011 NumEmployed2011 NumCivLaborForce2012
## 1:           8.4                25836           23677                25740
## 2:           9.0                85045           77418                84414
## 3:          11.5                 9849            8712                 9362
## 4:          10.5                 8933            7996                 8798
## 5:           8.7                25123           22939                24960
## 6:          11.6                 4833            4272                 4736
##    NumUnemployed2010 NumCivLaborForce2008 NumUnemployed2011 NumEmployed2010
## 1:              2282                24687              2159           23431
## 2:              8339                83223              7627           75120
## 3:              1262                10161              1137            8959
## 4:              1020                 8749               937            7914
## 5:              2446                26698              2184           22460
## 6:               585                 3634               561            4356
##    NumCivLaborForce2010 NumUnemployed2009 NumEmployed2009 NumCivLaborForce2009
## 1:                25713              2402           22301                24703
## 2:                83459              8048           74403                82451
## 3:                10221              1431            8572                10003
## 4:                 8934              1161            7581                 8742
## 5:                24906              2648           23832                26480
## 6:                 4941               583            3156                 3739
##    UnempRate2008 UnempRate2012 NumEmployed2008 UnempRate2009 NumUnemployed2008
## 1:           5.1           6.9           23420           9.7              1267
## 2:           4.6           7.5           79372           9.8              3851
## 3:           8.8          11.5            9267          14.3               894
## 4:           5.8           8.5            8241          13.3               508
## 5:           4.7           6.9           25453          10.0              1245
## 6:          10.5          10.4            3251          15.6               383
##    NumUnemployed2015 NumUnemployed2019 NumCivLaborforce2019 NumUnemployed2018
## 1:              1331               714                26172               935
## 2:              4862              2653                97328              3424
## 3:               765               324                 8537               427
## 4:               568               266                 8685               337
## 5:              1323               676                25331               868
## 6:               379               175                 4818               220
##    NumEmployed2018 NumCivLaborforce2018 NumUnemployed2017 NumEmployed2017
## 1:           25261                26196              1013           25062
## 2:           91809                95233              3745           88711
## 3:            7987                 8414               486            7863
## 4:            8268                 8605               375            8208
## 5:           24201                25069               998           23824
## 6:            4571                 4791               238            4614
##    NumCivLaborforce2017 NumUnemployed2016 NumEmployed2016 NumCivLaborforce2016
## 1:                26075              1322           24709                26031
## 2:                92456              4835           86060                90895
## 3:                 8349               700            7736                 8436
## 4:                 8583               556            8088                 8644
## 5:                24822              1326           23358                24684
## 6:                 4852               330            4514                 4844
##    NumCivLaborforce2013 NumEmployed2015 NumEmployed2012 NumUnemployed2014
## 1:                25810           24321           23961              1495
## 2:                85280           83010           78065              5301
## 3:                 9099            7860            8283               932
## 4:                 8705            8021            8047               617
## 5:                24887           23198           23244              1504
## 6:                 4778            4406            4245               418
##    NumEmployed2014 NumCivLaborforce2014 UnempRate2013 NumUnemployed2013
## 1:           24097                25592           6.2              1605
## 2:           81083                86384           6.6              5654
## 3:            7913                 8845          10.2               931
## 4:            7942                 8559           7.9               689
## 5:           23023                24527           6.3              1562
## 6:            4314                 4732           9.4               447
##    NumEmployed2019 NumEmployed2013 NumUnemployed2007 NumEmployed2007
## 1:           25458           24205               806           23577
## 2:           94675           79626              2560           80099
## 3:            8213            8168               650            9684
## 4:            8419            8016               359            8432
## 5:           24655           23325               849           25780
## 6:            4643            4331               345            3308
##    NumCivLaborforce2007 NumUnemployed2012 NumCivLaborforce2015
## 1:                24383              1779                25652
## 2:                82659              6349                87872
## 3:                10334              1079                 8625
## 4:                 8791               751                 8589
## 5:                26629              1716                24521
## 6:                 3653               491                 4785
##    RuralUrbanContinuumCode2013 UrbanInfluenceCode2013
## 1:                           2                      2
## 2:                           3                      2
## 3:                           6                      6
## 4:                           1                      1
## 5:                           1                      1
## 6:                           6                      6
##    RuralUrbanContinuumCode2003 UrbanInfluenceCode2003 Metro2013 Nonmetro2013
## 1:                           2                      2         1            0
## 2:                           4                      5         1            0
## 3:                           6                      6         0            1
## 4:                           1                      1         1            0
## 5:                           1                      1         1            0
## 6:                           6                      6         0            1
##    Micropolitan2013 Type_2015_Update Type_2015_Farming_NO
## 1:                0                0                    0
## 2:                0                5                    0
## 3:                0                3                    0
## 4:                0                0                    0
## 5:                0                0                    0
## 6:                0                3                    0
##    Type_2015_Manufacturing_NO Type_2015_Mining_NO Type_2015_Government_NO
## 1:                          0                   0                       0
## 2:                          0                   0                       0
## 3:                          1                   0                       0
## 4:                          0                   0                       0
## 5:                          0                   0                       0
## 6:                          1                   0                       0
##    Type_2015_Recreation_NO Low_Education_2015_update Low_Employment_2015_update
## 1:                       0                         0                          0
## 2:                       1                         0                          0
## 3:                       0                         1                          1
## 4:                       0                         1                          1
## 5:                       0                         1                          1
## 6:                       0                         1                          1
##    Population_loss_2015_update Retirement_Destination_2015_Update
## 1:                           0                                  1
## 2:                           0                                  1
## 3:                           0                                  0
## 4:                           0                                  0
## 5:                           0                                  0
## 6:                           0                                  0
##    Perpov_1980_0711 PersistentChildPoverty_1980_2011 Hipov HiAmenity
## 1:                0                                0     0         0
## 2:                0                                0     0         1
## 3:                1                                1     1         0
## 4:                0                                1     0         0
## 5:                0                                0     0         0
## 6:                1                                1     1         0
##    HiCreativeClass2000 Gas_Change Oil_Change Oil_Gas_Change Metro2003
## 1:                   1          0          0              0         1
## 2:                   1          9          0              9         0
## 3:                   0          0          0              0         0
## 4:                   0          0          0              0         1
## 5:                   0          0          0              0         1
## 6:                   0          0          0              0         0
##    NonmetroNotAdj2003 NonmetroAdj2003 Noncore2003 EconomicDependence2000
## 1:                  0               0           0                      3
## 2:                  0               1           0                      5
## 3:                  0               1           1                      3
## 4:                  0               0           0                      6
## 5:                  0               0           0                      6
## 6:                  0               1           1                      1
##    Nonmetro2003 Micropolitan2003 FarmDependent2003 ManufacturingDependent2000
## 1:            0                0                 0                          1
## 2:            1                1                 0                          0
## 3:            1                0                 0                          1
## 4:            0                0                 0                          0
## 5:            0                0                 0                          0
## 6:            1                0                 1                          0
##    LowEducation2000 RetirementDestination2000 PersistentPoverty2000 Noncore2013
## 1:                0                         0                     0           0
## 2:                0                         1                     0           0
## 3:                1                         0                     1           1
## 4:                1                         1                     1           0
## 5:                0                         1                     0           0
## 6:                1                         0                     1           1
##    Type_2015_Nonspecialized_NO Metro_Adjacent2013 PersistentChildPoverty2004
## 1:                           1                  0                          0
## 2:                           0                  0                          0
## 3:                           0                  1                          1
## 4:                           1                  0                          1
## 5:                           1                  0                          0
## 6:                           0                  1                          1
##    RecreationDependent2000
## 1:                       0
## 2:                       1
## 3:                       0
## 4:                       0
## 5:                       0
## 6:                       0
  1. Select possible variables in county_data as covariates. We provide county_data_sub, a subset variables from county_data, for you to get started. Please add any potential variables as you wish.

    1. Report missing values in your final subset of variables.

    2. In the following anaylsis, you may ignore the missing values.

county_data_sub <- county_data %>%
  select(County, State, FIPS, Deep_Pov_All, PovertyAllAgesPct, PerCapitaInc, UnempRate2019, PctEmpFIRE, PctEmpConstruction, PctEmpTrans, PctEmpMining, PctEmpTrade, PctEmpInformation, PctEmpAgriculture, PctEmpManufacturing, PctEmpServices, PopDensity2010, OwnHomePct, Age65AndOlderPct2010, TotalPop25Plus, Under18Pct2010, Ed2HSDiplomaOnlyPct, Ed3SomeCollegePct, Ed4AssocDegreePct, Ed5CollegePlusPct, ForeignBornPct, Net_International_Migration_Rate_2010_2019, NetMigrationRate1019, NaturalChangeRate1019, TotalPopEst2019, WhiteNonHispanicPct2010, NativeAmericanNonHispanicPct2010, BlackNonHispanicPct2010, AsianNonHispanicPct2010, HispanicPct2010, Type_2015_Update, RuralUrbanContinuumCode2013, UrbanInfluenceCode2013, Perpov_1980_0711, HiCreativeClass2000, HiAmenity, Retirement_Destination_2015_Update)
library(leaps)
#using Log of total deaths per 100k, and using forward selection
fit.forward <- regsubsets(log_total_death_per100k ~ ., covid_factor , nvmax=25, method="forward")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 27 linear dependencies found
## Reordering variables and trying again:
f.f <- summary(fit.forward)
fit.forward.var <- f.f$which
colnames(fit.forward.var)[fit.forward.var[5,]]
## [1] "(Intercept)"         "StateKY"             "StateVT"            
## [4] "MultipleRacePct2010" "Ed5CollegePlusPct"   "FemaleHHPct"
#(Intercept), StateKY, StateVT, MultipleRacePct2010, Ed5CollegePlusPct, FemaleHHPct
fit <- lm(log_total_death_per100k ~ State + Ed5CollegePlusPct + FemaleHHPct, covid_factor)
summary(fit)
## 
## Call:
## lm(formula = log_total_death_per100k ~ State + Ed5CollegePlusPct + 
##     FemaleHHPct, data = covid_factor)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.656 -0.253  0.083  0.400  2.721 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.86584    0.12813   37.98  < 2e-16 ***
## StateAR            0.01516    0.13944    0.11  0.91345    
## StateAZ            0.28936    0.23677    1.22  0.22177    
## StateCA           -0.89151    0.14980   -5.95  3.0e-09 ***
## StateCO           -0.67131    0.14867   -4.52  6.6e-06 ***
## StateCT            0.34313    0.31184    1.10  0.27126    
## StateDC            0.32143    0.83762    0.38  0.70120    
## StateDE           -0.19134    0.48924   -0.39  0.69576    
## StateFL           -0.13936    0.14362   -0.97  0.33193    
## StateGA           -0.11282    0.12071   -0.93  0.35006    
## StateIA            0.35833    0.13437    2.67  0.00770 ** 
## StateID           -0.62016    0.16311   -3.80  0.00015 ***
## StateIL            0.23992    0.13173    1.82  0.06865 .  
## StateIN            0.00495    0.13437    0.04  0.97060    
## StateKS           -0.67912    0.13297   -5.11  3.5e-07 ***
## StateKY           -0.79609    0.12706   -6.27  4.2e-10 ***
## StateLA            0.09333    0.14490    0.64  0.51955    
## StateMA            0.03595    0.24698    0.15  0.88428    
## StateMD           -0.21210    0.19871   -1.07  0.28588    
## StateME           -1.46958    0.23209   -6.33  2.8e-10 ***
## StateMI           -0.10709    0.13792   -0.78  0.43752    
## StateMN           -0.22115    0.13777   -1.61  0.10855    
## StateMO           -0.37610    0.12905   -2.91  0.00359 ** 
## StateMS            0.11317    0.13743    0.82  0.41031    
## StateMT           -0.02417    0.15380   -0.16  0.87511    
## StateNC           -0.49066    0.13106   -3.74  0.00018 ***
## StateND            0.36006    0.15637    2.30  0.02137 *  
## StateNE           -0.60296    0.13693   -4.40  1.1e-05 ***
## StateNH           -0.84115    0.28290   -2.97  0.00297 ** 
## StateNJ            0.56256    0.20986    2.68  0.00739 ** 
## StateNM           -0.37707    0.17658   -2.14  0.03280 *  
## StateNV           -0.91057    0.22628   -4.02  5.9e-05 ***
## StateNY           -0.25719    0.14752   -1.74  0.08137 .  
## StateOH           -0.41665    0.13528   -3.08  0.00209 ** 
## StateOK           -0.46489    0.13921   -3.34  0.00085 ***
## StateOR           -1.17559    0.17266   -6.81  1.2e-11 ***
## StatePA            0.21858    0.14470    1.51  0.13100    
## StateRI           -0.01702    0.38616   -0.04  0.96486    
## StateSC           -0.09686    0.15887   -0.61  0.54211    
## StateSD            0.47093    0.14554    3.24  0.00123 ** 
## StateTN            0.03739    0.13278    0.28  0.77828    
## StateTX            0.09653    0.11443    0.84  0.39896    
## StateUT           -1.20255    0.18672   -6.44  1.4e-10 ***
## StateVA           -0.56371    0.12533   -4.50  7.1e-06 ***
## StateVT           -2.23193    0.24585   -9.08  < 2e-16 ***
## StateWA           -1.18192    0.16862   -7.01  2.9e-12 ***
## StateWI           -0.12905    0.14332   -0.90  0.36794    
## StateWV           -0.66306    0.15187   -4.37  1.3e-05 ***
## StateWY           -0.22907    0.20306   -1.13  0.25935    
## Ed5CollegePlusPct -0.01579    0.00181   -8.74  < 2e-16 ***
## FemaleHHPct        0.04162    0.00448    9.29  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.828 on 3057 degrees of freedom
## Multiple R-squared:  0.292,  Adjusted R-squared:  0.28 
## F-statistic: 25.2 on 50 and 3057 DF,  p-value: <2e-16
covid_factor[which(is.na(covid_factor$Ed5CollegePlusPct))]
## Empty data.table (0 rows and 212 cols): FIPS,cum_cases,cum_deaths,week,log_total_death_per100k,State...
covid_factor[which(is.na(covid_factor$State))]
## Empty data.table (0 rows and 212 cols): FIPS,cum_cases,cum_deaths,week,log_total_death_per100k,State...
covid_factor[which(is.na(covid_factor$FemaleHHPct))]
## Empty data.table (0 rows and 212 cols): FIPS,cum_cases,cum_deaths,week,log_total_death_per100k,State...
#There are no missing values in our final selection of variables
  1. Use LASSO to choose a parsimonious model with all available sensible county-level information. Force in State in the process. Why we need to force in State?

  2. Use Cp or BIC to fine tune the LASSO model from iii). Again force in State in the process.

  3. If necessary, reduce the model from iv) to a final model with all variables being siginificant at 0.05 level. Are the linear model assumptions all reasonably met?

  4. It has been shown that COVID affects elderly the most. It is also claimed that the COVID death rate among African Americans and Latinxs is higher. Does your analysis support these arguments?

  5. Based on your final model, summarize your findings. In particular, summarize the state effect controlling for others. Provide intervention recommendations to policy makers to reduce COVID death rate.

  6. What else can we do to improve our model? What other important information we may have missed?

Appendix 1: Data description

A detailed summary of the variables in each data set follows:

Infection and fatality data

  • date: Date
  • county: County name
  • state: State name
  • fips: County code that uniquely identifies a county
  • cases: Number of cumulative COVID-19 infections
  • deaths: Number of cumulative COVID-19 deaths

Socioeconomic demographics

Income: Poverty level and household income

  • PovertyUnder18Pct: Poverty rate for children age 0-17, 2018
  • Deep_Pov_All: Deep poverty, 2014-18
  • Deep_Pov_Children: Deep poverty for children, 2014-18
  • PovertyAllAgesPct: Poverty rate, 2018
  • MedHHInc: Median household income, 2018 (In 2018 dollars)
  • PerCapitaInc: Per capita income in the past 12 months (In 2018 inflation adjusted dollars), 2014-18
  • PovertyAllAgesNum: Number of people of all ages in poverty, 2018
  • PovertyUnder18Num: Number of people age 0-17 in poverty, 2018

Jobs: Employment type, rate, and change

  • UnempRate2007-2019: Unemployment rate, 2007-2019

  • NumEmployed2007-2019: Employed, 2007-2019

  • NumUnemployed2007-2019: Unemployed, 2007-2019

  • PctEmpChange1019: Percent employment change, 2010-19

  • PctEmpChange1819: Percent employment change, 2018-19

  • PctEmpChange0719: Percent employment change, 2007-19

  • PctEmpChange0710: Percent employment change, 2007-10

  • NumCivEmployed: Civilian employed population 16 years and over, 2014-18

  • NumCivLaborforce2007-2019: Civilian labor force, 2007-2019

  • PctEmpFIRE: Percent of the civilian labor force 16 and over employed in finance and insurance, and real estate and rental and leasing, 2014-18

  • PctEmpConstruction: Percent of the civilian labor force 16 and over employed in construction, 2014-18

  • PctEmpTrans: Percent of the civilian labor force 16 and over employed in transportation, warehousing and utilities, 2014-18

  • PctEmpMining: Percent of the civilian labor force 16 and over employed in mining, quarrying, oil and gas extraction, 2014-18

  • PctEmpTrade: Percent of the civilian labor force 16 and over employed in wholesale and retail trade, 2014-18

  • PctEmpInformation: Percent of the civilian labor force 16 and over employed in information services, 2014-18

  • PctEmpAgriculture: Percent of the civilian labor force 16 and over employed in agriculture, forestry, fishing, and hunting, 2014-18

  • PctEmpManufacturing: Percent of the civilian labor force 16 and over employed in manufacturing, 2014-18

  • PctEmpServices: Percent of the civilian labor force 16 and over employed in services, 2014-18

  • PctEmpGovt: Percent of the civilian labor force 16 and over employed in public administration, 2014-18

People: Population size, density, education level, race, age, household size, and migration rates

  • PopDensity2010: Population density, 2010

  • LandAreaSQMiles2010: Land area in square miles, 2010

  • TotalHH: Total number of households, 2014-18

  • TotalOccHU: Total number of occupied housing units, 2014-18

  • AvgHHSize: Average household size, 2014-18

  • OwnHomeNum: Number of owner occupied housing units, 2014-18

  • OwnHomePct: Percent of owner occupied housing units, 2014-18

  • NonEnglishHHPct: Percent of non-English speaking households of total households, 2014-18

  • HH65PlusAlonePct: Percent of persons 65 or older living alone, 2014-18

  • FemaleHHPct: Percent of female headed family households of total households, 2014-18

  • FemaleHHNum: Number of female headed family households, 2014-18

  • NonEnglishHHNum: Number of non-English speaking households, 2014-18

  • HH65PlusAloneNum: Number of persons 65 years or older living alone, 2014-18

  • Age65AndOlderPct2010: Percent of population 65 or older, 2010

  • Age65AndOlderNum2010: Population 65 years or older, 2010

  • TotalPop25Plus: Total population 25 and older, 2014-18 - 5-year average

  • Under18Pct2010: Percent of population under age 18, 2010

  • Under18Num2010: Population under age 18, 2010

  • Ed1LessThanHSPct: Percent of persons with no high school diploma or GED, adults 25 and over, 2014-18

  • Ed2HSDiplomaOnlyPct: Percent of persons with a high school diploma or GED only, adults 25 and over, 2014-18

  • Ed3SomeCollegePct: Percent of persons with some college experience, adults 25 and over, 2014-18

  • Ed4AssocDegreePct: Percent of persons with an associate’s degree, adults 25 and over, 2014-18

  • Ed5CollegePlusPct: Percent of persons with a 4-year college degree or more, adults 25 and over, 2014-18

  • Ed1LessThanHSNum: No high school, adults 25 and over, 2014-18

  • Ed2HSDiplomaOnlyNum: High school only, adults 25 and over, 2014-18

  • Ed3SomeCollegeNum: Some college experience, adults 25 and over, 2014-18

  • Ed4AssocDegreeNum: Number of persons with an associate’s degree, adults 25 and over, 2014-18

  • Ed5CollegePlusNum: College degree 4-years or more, adults 25 and over, 2014-18

  • ForeignBornPct: Percent of total population foreign born, 2014-18

  • ForeignBornEuropePct: Percent of persons born in Europe, 2014-18

  • ForeignBornMexPct: Percent of persons born in Mexico, 2014-18

  • ForeignBornCentralSouthAmPct: Percent of persons born in Central or South America, 2014-18

  • ForeignBornAsiaPct: Percent of persons born in Asia, 2014-18

  • ForeignBornCaribPct: Percent of persons born in the Caribbean, 2014-18

  • ForeignBornAfricaPct: Percent of persons born in Africa, 2014-18

  • ForeignBornNum: Number of people foreign born, 2014-18

  • ForeignBornCentralSouthAmNum: Number of persons born in Central or South America, 2014-18

  • ForeignBornEuropeNum: Number of persons born in Europe, 2014-18

  • ForeignBornMexNum: Number of persons born in Mexico, 2014-18

  • ForeignBornAfricaNum: Number of persons born in Africa, 2014-18

  • ForeignBornAsiaNum: Number of persons born in Asia, 2014-18

  • ForeignBornCaribNum: Number of persons born in the Caribbean, 2014-18

  • Net_International_Migration_Rate_2010_2019: Net international migration rate, 2010-19

  • Net_International_Migration_2010_2019: Net international migration, 2010-19

  • Net_International_Migration_2000_2010: Net international migration, 2000-10

  • Immigration_Rate_2000_2010: Net international migration rate, 2000-10

  • NetMigrationRate0010: Net migration rate, 2000-10

  • NetMigrationRate1019: Net migration rate, 2010-19

  • NetMigrationNum0010: Net migration, 2000-10

  • NetMigration1019: Net Migration, 2010-19

  • NaturalChangeRate1019: Natural population change rate, 2010-19

  • NaturalChangeRate0010: Natural population change rate, 2000-10

  • NaturalChangeNum0010: Natural change, 2000-10

  • NaturalChange1019: Natural population change, 2010-19

  • TotalPop2010: Population size 4/1/2010 Census

  • TotalPopEst2010: Population size 7/1/2010

  • TotalPopEst2011: Population size 7/1/2011

  • TotalPopEst2012: Population size 7/1/2012

  • TotalPopEst2013: Population size 7/1/2013

  • TotalPopEst2014: Population size 7/1/2014

  • TotalPopEst2015: Population size 7/1/2015

  • TotalPopEst2016: Population size 7/1/2016

  • TotalPopEst2017: Population size 7/1/2017

  • TotalPopEst2018: Population size 7/1/2018

  • TotalPopEst2019: Population size 7/1/2019

  • TotalPopACS: Total population, 2014-18 - 5-year average

  • TotalPopEstBase2010: County Population estimate base 4/1/2010

  • NonHispanicAsianPopChangeRate0010: Population change rate Non-Hispanic Asian, 2000-10

  • PopChangeRate1819: Population change rate, 2018-19

  • PopChangeRate1019: Population change rate, 2010-19

  • PopChangeRate0010: Population change rate, 2000-10

  • NonHispanicNativeAmericanPopChangeRate0010: Population change rate Non-Hispanic Native American, 2000-10

  • HispanicPopChangeRate0010: Population change rate Hispanic, 2000-10

  • MultipleRacePopChangeRate0010: Population change rate multiple race, 2000-10

  • NonHispanicWhitePopChangeRate0010: Population change rate Non-Hispanic White, 2000-10

  • NonHispanicBlackPopChangeRate0010: Population change rate Non-Hispanic African American, 2000-10

  • MultipleRacePct2010: Percent multiple race, 2010

  • WhiteNonHispanicPct2010: Percent Non-Hispanic White, 2010

  • NativeAmericanNonHispanicPct2010: Percent Non-Hispanic Native American, 2010

  • BlackNonHispanicPct2010: Percent Non-Hispanic African American, 2010

  • AsianNonHispanicPct2010: Percent Non-Hispanic Asian, 2010

  • HispanicPct2010: Percent Hispanic, 2010

  • MultipleRaceNum2010: Population size multiple race, 2010

  • WhiteNonHispanicNum2010: Population size Non-Hispanic White, 2010

  • BlackNonHispanicNum2010: Population size Non-Hispanic African American, 2010

  • NativeAmericanNonHispanicNum2010: Population size Non-Hispanic Native American, 2010

  • AsianNonHispanicNum2010: Population size Non-Hispanic Asian, 2010

  • HispanicNum2010: Population size Hispanic, 2010

County classifications: Type of county (rural or urban on a rural-urban continuum scale)

  • Type_2015_Recreation_NO: Recreation counties, 2015 edition

  • Type_2015_Farming_NO: Farming-dependent counties, 2015 edition

  • Type_2015_Mining_NO: Mining-dependent counties, 2015 edition

  • Type_2015_Government_NO: Federal/State government-dependent counties, 2015 edition

  • Type_2015_Update: County typology economic types, 2015 edition

  • Type_2015_Manufacturing_NO: Manufacturing-dependent counties, 2015 edition

  • Type_2015_Nonspecialized_NO: Nonspecialized counties, 2015 edition

  • RecreationDependent2000: Nonmetro recreation-dependent, 1997-00

  • ManufacturingDependent2000: Manufacturing-dependent, 1998-00

  • FarmDependent2003: Farm-dependent, 1998-00

  • EconomicDependence2000: Economic dependence, 1998-00

  • RuralUrbanContinuumCode2003: Rural-urban continuum code, 2003

  • UrbanInfluenceCode2003: Urban influence code, 2003

  • RuralUrbanContinuumCode2013: Rural-urban continuum code, 2013

  • UrbanInfluenceCode2013: Urban influence code, 2013

  • Noncore2013: Nonmetro noncore, outside Micropolitan and Metropolitan, 2013

  • Micropolitan2013: Micropolitan, 2013

  • Nonmetro2013: Nonmetro, 2013

  • Metro2013: Metro, 2013

  • Metro_Adjacent2013: Nonmetro, adjacent to metro area, 2013

  • Noncore2003: Nonmetro noncore, outside Micropolitan and Metropolitan, 2003

  • Micropolitan2003: Micropolitan, 2003

  • Metro2003: Metro, 2003

  • Nonmetro2003: Nonmetro, 2003

  • NonmetroNotAdj2003: Nonmetro, nonadjacent to metro area, 2003

  • NonmetroAdj2003: Nonmetro, adjacent to metro area, 2003

  • Oil_Gas_Change: Change in the value of onshore oil and natural gas production, 2000-11

  • Gas_Change: Change in the value of onshore natural gas production, 2000-11

  • Oil_Change: Change in the value of onshore oil production, 2000-11

  • Hipov: High poverty counties, 2014-18

  • Perpov_1980_0711: Persistent poverty counties, 2015 edition

  • PersistentChildPoverty_1980_2011: Persistent child poverty counties, 2015 edition

  • PersistentChildPoverty2004: Persistent child poverty counties, 2004

  • PersistentPoverty2000: Persistent poverty counties, 2004

  • Low_Education_2015_update: Low education counties, 2015 edition

  • LowEducation2000: Low education, 2000

  • HiCreativeClass2000: Creative class, 2000

  • HiAmenity: High natural amenities

  • RetirementDestination2000: Retirement destination, 1990-00

  • Low_Employment_2015_update: Low employment counties, 2015 edition

  • Population_loss_2015_update: Population loss counties, 2015 edition

  • Retirement_Destination_2015_Update: Retirement destination counties, 2015 edition

Appendix 2: Data cleaning

The raw data sets are dirty and need transforming before we can do our EDA. It takes time and efforts to clean and merge different data sources so we provide the final output of the cleaned and merged data. The cleaning procedure is as follows. Please read through to understand what is in the cleaned data. We set eval = data_cleaned in the following cleaning chunks so that these cleaning chunks will only run if any of data/covid_county.csv, data/covid_rates.csv or data/covid_intervention.csv does not exist.

# Indicator to check whether the data files exist
data_cleaned <- !(file.exists("data/covid_county.csv") & 
                    file.exists("data/covid_rates.csv") & 
                    file.exists("data/covid_intervention.csv"))

We first read in the table using data.table::fread(), as we did last time.

# COVID case/mortality rate data
covid_rates <- fread("data/us_counties.csv", na.strings = c("NA", "", "."))
nyc <- fread("data/nycdata.csv", na.strings = c("NA", "", "."))

# Socioeconomic data
income <- fread("data/income.csv", na.strings = c("NA", "", "."))
jobs <- fread("data/jobs.csv", na.strings = c("NA", "", "."))
people <- fread("data/people.csv", na.strings = c("NA", "", "."))
county_class <- fread("data/county_classifications.csv", na.strings = c("NA", "", "."))

# Internvention policy data
int_dates <- fread("data/intervention_dates.csv", na.strings = c("NA", "", "."))

4.1 Clean NYC data

The original NYC data contains more information than we need. We extract only the number of cases and deaths and format it the same as the covid_rates data.

# NYC county fips matching table 
nyc_fips <- data.table(FIPS = c('36005', '36047', '36061', '36081', '36085'),
                       County = c("BX", "BK", "MN", "QN", "SI"))

# nyc case
nyc_case <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
                   BX = BX_CASE_COUNT, 
                   BK = BK_CASE_COUNT, 
                   MN = MN_CASE_COUNT, 
                   QN = QN_CASE_COUNT, 
                   SI = SI_CASE_COUNT)]

nyc_case %<>% 
  pivot_longer(cols = BX:SI, 
               names_to = "County",
               values_to = "cases") %>%
  arrange(date) %>%
  group_by(County) %>%
  mutate(cum_cases = cumsum(cases))

# nyc death
nyc_death <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
                   BX = BX_DEATH_COUNT, 
                   BK = BK_DEATH_COUNT, 
                   MN = MN_DEATH_COUNT, 
                   QN = QN_DEATH_COUNT, 
                   SI = SI_DEATH_COUNT)]

nyc_death %<>% 
  pivot_longer(cols = BX:SI, 
               names_to = "County",
               values_to = "deaths") %>%
  arrange(date) %>%
  group_by(County) %>%
  mutate(cum_deaths = cumsum(deaths))

nyc_rates <- merge(nyc_case, 
                   nyc_death,
                   by = c("date", "County"),
                   all.x= T)

nyc_rates <- merge(nyc_rates,
                   nyc_fips,
                   by = "County")

nyc_rates$State <- "NY"
nyc_rates %<>% 
  select(date, FIPS, County, State, cum_cases, cum_deaths) %>%
  arrange(FIPS, date)

4.2 Continental US cases

We only consider cases and death in continental US. Alaska, Hawaii, and Puerto Rico have 02, 15, and 72 as their respective first 2 digits of their FIPS. We use the %/% operator for integer division to get the first 2 digits of FIPS. All data of counties in NYC are aggregated as County == "New York City" in covid_rates with no FIPS, so we combine the NYC data into covid_rate.

covid_rates <- covid_rates %>% 
  arrange(fips, date) %>%
  filter(!(fips %/% 1000 %in% c(2, 15, 72))) %>%
  filter(county != "New York City") %>%
  rename(FIPS = "fips",
         County = "county",
         State = "state",
         cum_cases = "cases", 
         cum_deaths = "deaths")


covid_rates$date <- as.Date(covid_rates$date)

covid_rates <- rbind(covid_rates, 
                     nyc_rates)

4.3 COVID date to week

We set the week of Jan 21, 2020 (the first case of COVID case in US) as the first week (2020-01-19 to 2020-01-25).

covid_rates[, week := (interval("2020-01-19", date) %/% weeks(1)) + 1]

4.4 COVID infection/mortality rates

Merge the TotalPopEst2019 variable from the demographic data with covid_rates by FIPS and calculate the cumulative infection/mortality rates for each county (divide the cumulative cases/deaths by county population). FIPS for Kansas city, Missouri, Rhode Island and some others are missing. We drop them for the moment and output the data up to week 57 as covid_rates.csv.

covid_rates <- merge(covid_rates[!is.na(FIPS)], 
                     people[,.(FIPS = as.character(FIPS),
                                   TotalPopEst2019)],
                     by = "FIPS",  
                     all.x = TRUE)

covid_rates %<>% 
  mutate(cum_infection_rate = cum_cases/TotalPopEst2019,
         cum_mortality_rate = cum_deaths/TotalPopEst2019)

fwrite(covid_rates %>% 
         filter(week < 58) %>% 
         arrange(FIPS, date), 
       "data/covid_rates.csv")

4.5 Formatting date in int_dates

We convert the columns representing dates in int_dates to R Date types using as.Date(). We will need to specify that the origin parameter is "0001-01-01". We output the data as covid_intervention.csv.

int_dates <- int_dates[-1,]
date_cols <- names(int_dates)[-(1:3)]
int_dates[, (date_cols) := lapply(.SD, as.Date, origin = "0001-01-01"), 
          .SDcols = date_cols]

fwrite(int_dates, "data/covid_intervention.csv")

4.6 NA in COVID data

NA values in the covid_rates data set correspond to a county not having confirmed cases/deaths. We replace the NA values in these columns with zeros.

covid_rates$cum_cases[is.na(covid_rates$cum_cases)] <- 0
covid_rates$cum_deaths[is.na(covid_rates$cum_deaths)] <- 0

4.7 Merge demographic data

Merge the demographic data sets by FIPS and output as covid_county.csv.

countydata <-
  merge(x = income,
        y = merge(
          x = people,
          y = jobs,
          by = c("FIPS", "State", "County")),
        by = c("FIPS", "State", "County"),
        all = TRUE)


countydata <- 
  merge(
    x = countydata,
    y = county_class %>% rename(FIPS = FIPStxt), 
    by = c("FIPS", "State", "County"),
    all = TRUE
  )

# Check dimensions
# They are now 3279 x 208
dim(countydata)
fwrite(countydata, "data/covid_county.csv")